-
Notifications
You must be signed in to change notification settings - Fork 135
fix(guard): handle actor stopped during ws/req #3481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(guard): handle actor stopped during ws/req #3481
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
2 Skipped Deployments
|
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
How to use the Graphite Merge QueueAdd the label merge-queue to this PR to add it to the merge queue. You must have a Graphite account in order to use the merge queue. Sign up using this link. An organization admin has enabled the Graphite Merge Queue in this repository. Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue. This stack of pull requests is managed by Graphite. Learn more about stacking. |
Pull Request ReviewSummaryThis PR fixes a race condition where actor stops during WebSocket connections or HTTP requests weren't being properly handled, leading to requests hanging indefinitely. The fix adds subscriptions to the ✅ Positive Aspects
🔍 Code Quality Observations1. Duplicate Subscription PatternIn 2. Runner WebSocket Behavior Change (Important)In 3. Comment AccuracyLine 385 comment changed from "Send reclaimed messages" to "Send pending messages" - good improvement in clarity! ✅ 🐛 Potential Issues1. Subscription Timing in Hibernation FlowThe 2. Error Propagation During CleanupWhen sending the WebSocket close message fails, only an error log is produced. Should this error be propagated? If the close message fails to send, the runner might not clean up properly. 🔒 Security ConsiderationsNo security issues identified. The changes properly handle error cases and don't introduce new attack vectors. ✅ ⚡ Performance Considerations
🧪 Test CoverageMajor Gap: No tests are included with this PR for critical race condition fixes. Recommendations:
🎯 Recommendations SummaryMust Address:
Should Consider:
Final VerdictThis is a solid bug fix that addresses a real race condition. The implementation is clean and follows existing patterns. However, the lack of tests for such critical race condition fixes is concerning. Recommended Action: Approve with the requirement that tests be added before merging, especially for the WebSocket-specific behavior changes. Great work on identifying and fixing this race condition! 🚀 |
6c5e080 to
dc9084a
Compare
e83ac9b to
00cd5f2
Compare
PR Review: fix(guard): handle actor stopped during ws/reqSummaryThis PR adds handling for actor stopped events during WebSocket and HTTP request processing in the pegboard gateway, ensuring graceful cleanup when actors stop mid-request. The changes also improve WebSocket lifecycle management in the runner. Code Quality ✅Strengths:
Detailed Analysis1. pegboard-gateway/src/lib.rs - Actor Stopped HandlingLines 155-158, 290-293: Subscription creation let mut stopped_sub = self
.ctx
.subscribe::<pegboard::workflows::actor::Stopped>(("actor_id", self.actor_id))
.await?;✅ Good: Subscribes to actor stopped events before processing requests Lines 220-223, 356-359, 436-444: Select arms for stopped events _ = stopped_sub.next() => {
tracing::debug!("actor stopped while waiting for request response");
return Err(ServiceUnavailable.build());
}✅ Good: Appropriate error types returned for each context 2. pegboard-gateway/src/shared_state.rs - Pending Messages CheckLines 347-357: New pub async fn has_pending_websocket_messages(&self, request_id: RequestId) -> Result<bool> {
let Some(req) = self.in_flight_requests.get_async(&request_id).await else {
bail!("request not in flight");
};
if let Some(hs) = &req.hibernation_state {
Ok(!hs.pending_ws_msgs.is_empty())
} else {
Ok(false)
}
}✅ Good: Simple, focused method Lines 609-616: Early wake optimization in // Immediately rewake if we have pending messages
if self
.shared_state
.has_pending_websocket_messages(unique_request_id.into_bytes())
.await?
{
return Ok(HibernationResult::Continue);
}✅ Excellent: Prevents unnecessary hibernation when messages are pending 3. pegboard-runner/src/lib.rs - WebSocket Lifecycle ChangesLines 246-279: Simplified close message handling // Send close messages to all remaining active requests
let active_requests = conn.tunnel_active_requests.lock().await;
for (request_id, req) in &*active_requests {
// Websockets are not ephemeral like requests. If the runner ws closes they are not informed;
// instead they wait for the actor itself to stop.
if req.is_ws {
continue;
}
// ... send ToServerResponseAbort
}✅ Good: Clear distinction between WebSocket and HTTP request lifecycle Removed Code Analysis:
Potential Issues & SuggestionsMinor Issues
// Send pending messagesPrevious comment said "Send reclaimed messages" - good fix! ✅
// Subscribe before reading body
let mut stopped_sub = self.ctx.subscribe::<...>(...).await?;
let body_bytes = req.into_body().collect().await
.context("failed to read body")?
.to_bytes();
Questions for Discussion
Performance Considerations✅ Good: The early wake check (lines 609-616) prevents unnecessary hibernation overhead Security Concerns✅ No security issues identified
Test Coverage
Recommendations
VerdictApproval: ✅ Approved with Suggestions This is a solid PR that improves reliability by handling actor stopped events gracefully. The code follows existing patterns, has appropriate error handling, and includes a nice optimization for hibernation. The suggestions above are minor improvements that could be addressed in follow-up work. Strengths:
Action Items:
|
Code Review: PR #3481 - fix(guard): handle actor stopped during ws/reqSummaryThis PR adds proper handling for actor stop events during WebSocket and HTTP request processing in the pegboard gateway. The changes ensure graceful cleanup when actors stop while requests are in-flight. Code Quality ✅Strengths:
Potential Issues
|
PR Review: fix(guard): handle actor stopped during ws/reqOverviewThis PR addresses an important issue where the gateway doesn't properly handle actor stop events during WebSocket and HTTP request processing. The implementation subscribes to actor Code Quality & Best Practices✅ Strengths
🔍 Potential Issues1. Logic Error in
|
d34ad3e to
44b3d81
Compare
Code ReviewSummaryThis PR implements proper handling of actor stops during WebSocket and HTTP requests in the pegboard gateway. The changes ensure that when an actor stops, in-flight requests and WebSocket connections are gracefully terminated instead of hanging. Positive Aspects✅ Core Fix is Sound: The addition of stopped_sub subscriptions to detect actor stop events is the correct approach. This properly handles the edge case where an actor stops while requests are being processed. ✅ Consistent Error Handling: The use of ServiceUnavailable for HTTP requests and conditional error handling for WebSockets (hibernate-capable vs non-hibernate) is appropriate. ✅ Optimization: The early check in handle_websocket_hibernation (lines 614-621) to immediately wake if there are pending messages is a good performance optimization. ✅ Proper Visibility Change: Making is_ws_hibernate public allows proper detection of hibernation errors when deciding whether to send close frames. ✅ Lifecycle Cleanup: The runner cleanup logic correctly skips WebSocket close messages when the runner itself disconnects, as WebSockets wait for actor stop events. |
Issues and Concerns🔴 Critical: Logic Error in Close Frame ConditionLines 562-566 in pegboard-gateway/src/lib.rs have an inverted condition: if lifecycle_res
.as_ref()
.map_or_else(is_ws_hibernate, |_| false)
{Problem: map_or_else evaluates the first closure (checking hibernation) when the result is Err, and returns false for Ok values. This means:
Expected: Close frames should be sent to the runner UNLESS we're hibernating. Fix Option 1 - Negate the condition: if \!lifecycle_res.as_ref().map_or_else(is_ws_hibernate, |_| false) {Fix Option 2 - Use clearer logic: if lifecycle_res.as_ref().err().map_or(true, |e| \!is_ws_hibernate(e)) { |
44b3d81 to
d34ad3e
Compare
Medium: Race Condition Risk in Request HandlerLines 155-158 subscribe to actor stop events AFTER reading the request body. If the actor stops between when the request arrives and when we subscribe, we might miss the stop event. Recommendation: Move the subscription to before reading the body (before line 148) to ensure we catch all stop events. Medium: Inconsistent Subscription TimingIn handle_request, the stopped subscription is created early (line 155), but in handle_websocket it is created later (line 290) after header extraction. While this may work due to workflow event replay, it would be more consistent and defensive to create subscriptions as early as possible in both handlers. Minor: Comment AccuracyLine 385 comment changed from "Send reclaimed messages" to "Send pending messages". While "pending" is clearer, verify this change is intentional and accurate for the hibernation recovery flow. Minor: Unrelated Documentation ChangesThe changes to website/public/llms-full.txt and llms.txt appear unrelated to this PR (removing TypeDoc links, adding blog post URL). These should ideally be in a separate commit or PR for clarity. |
Performance and Security ConsiderationsPerformance ✅
Security ✅
Test CoverageNo tests found for pegboard-gateway or pegboard-runner packages. Consider adding integration tests for:
Recommendations
Code QualityOverall code follows repository conventions well:
The core logic is sound, but the inverted condition is a critical bug that must be fixed before merge. |
Code ReviewThis PR adds proper handling for actor stopped events during WebSocket and HTTP request processing in the pegboard gateway. The changes look solid overall, but I have a few observations and suggestions: Positive Aspects
Issues and Concerns1. Logic Error in Close Message Condition
|
274d094 to
d44f1d1
Compare
d34ad3e to
44b3d81
Compare
Code ReviewSummaryThis PR fixes a critical issue where actors could be stopped during WebSocket or HTTP request handling, potentially leaving clients hanging without proper error responses. The changes introduce proper actor lifecycle monitoring and graceful handling when actors are stopped mid-request. ✅ Positive Aspects
🔍 Issues & Concerns1. Logic Error in Close Frame Sending (
|
44b3d81 to
d34ad3e
Compare
Code Review - PR #3481SummaryThis PR fixes a critical race condition where actors could be stopped during WebSocket or HTTP request handling, leading to requests hanging indefinitely. The solution subscribes to the actor::Stopped event and returns appropriate errors when the actor stops mid-request. Positive Observations
Critical IssuesIssue 1: Logic Error in Close Frame Condition (line 562-566) The condition appears inverted. The comment says send close frame if NOT hibernating, but the code will ONLY send close frames when there IS a hibernation error. Current: if lifecycle_res.as_ref().map_or_else(is_ws_hibernate, |_| false) This evaluates to true only when there is an error AND it is a hibernate error, which is backwards. Issue 2: Missing Stopped Subscription During Hibernation (line 673-705) When a WebSocket is hibernating, the code listens for actor::Ready events but not actor::Stopped events. If the actor stops during hibernation, the WebSocket will hang indefinitely until the client disconnects. Recommendations
Overall RatingNeeds Revision - The core fix is excellent, but issues 1 and 2 are critical bugs that could cause incorrect behavior in production. |
274d094 to
d44f1d1
Compare
d34ad3e to
44b3d81
Compare
Code ReviewThis PR effectively handles the case where an actor stops during WebSocket or HTTP request processing. The implementation is solid overall, with good error handling and proper subscription management. Here are my findings: ✅ Strengths
🔍 Potential Issues
🎯 Best Practices
🚀 Performance Considerations
🔒 SecurityNo security concerns identified. The changes maintain proper error boundaries and don't expose internal state. 🧪 Test CoverageConcern: No tests were added for this new functionality. Consider adding integration tests that:
📝 Minor Suggestions
SummaryRecommendation: Approve with minor suggestions The core functionality is correct and handles the critical race condition properly. The main concern is test coverage, and there's a minor readability improvement opportunity in the close frame condition. The changes follow the codebase conventions well and integrate cleanly with the existing error handling patterns. |
Merge activity
|

No description provided.